LE(age): Life expectancy at exact age \(x\) is defined as follows: The average number of remaining years of life expected by a hypothetical cohort of individuals alive at age \(x\) who would be subject during the remaining of their lives to the mortality rates of a given year. It is expressed as years.
We have downloaded life expectancy at exact age (for ages from 0 to 100 years) for different countries from the web page of United Nations, Department of Economic and Social Affairs, Extra Life Expectancy Division (2022). World Extra Life Expectancy Prospects 2022, Online Edition.
Other functions can be defined from LE(age):
Extra life expectancy: LE(age)-age. Annual increase in LE: LE(age)-LE(age-1), for age\(\ge 1\). Annual increase in LE, in logarithms: log( LE(age)-LE(age-1) ), for age\(\ge 1\). As an example, here we have the life expectancy data of Spain for year 2021:
Here you have the logarithm of annual increase in expected life length for year 2021, as functions of age for all the countries in the data set
#load data
load("/home/le/Dev/advanced_multivariate_analysis/lab/partial_exam/ExamOctober2023.Rdata")
head(LE)
## Age
## Country 0 1 2 3 4 5 6 7 8
## BI 61.6627 63.0588 62.2863 61.5243 60.7719 60.0247 59.2750 58.5125 57.7273
## KM 63.4174 65.4396 64.7592 63.9982 63.1776 62.3137 61.4187 60.5018 59.5698
## DJ 62.3049 63.7986 63.0696 62.3327 61.5869 60.8294 60.0560 59.2620 58.4437
## ER 66.5358 67.5069 66.6639 65.8203 64.9750 64.1257 63.2692 62.4015 61.5194
## ET 64.9748 66.2860 65.5365 64.7620 63.9660 63.1502 62.3152 61.4610 60.5876
## KE 61.4270 62.3108 61.6141 60.8061 59.9355 59.0282 58.0970 57.1547 56.2059
## Age
## Country 9 10 11 12 13 14 15 16 17
## BI 56.9138 56.0719 55.2062 54.3236 53.4315 52.5365 51.6443 50.7600 49.8869
## KM 58.6276 57.6790 56.7270 55.7743 54.8232 53.8760 52.9351 52.0023 51.0789
## DJ 57.6005 56.7343 55.8499 54.9530 54.0493 53.1444 52.2430 51.3493 50.4661
## ER 60.6219 59.7100 58.7866 57.8558 56.9214 55.9877 55.0584 54.1369 53.2256
## ET 59.6967 58.7907 57.8734 56.9489 56.0211 55.0942 54.1718 53.2571 52.3524
## KE 55.2515 54.2935 53.3337 52.3739 51.4155 50.4604 49.5108 48.5697 47.6370
## Age
## Country 18 19 20 21 22 23 24 25 26
## BI 49.0267 48.1789 47.3409 46.5097 45.6821 44.8559 44.0298 43.2031 42.3757
## KM 50.1648 49.2589 48.3588 47.4613 46.5644 45.6664 44.7669 43.8656 42.9630
## DJ 49.5948 48.7350 47.8846 47.0411 46.2016 45.3640 44.5268 43.6891 42.8505
## ER 52.3256 51.4363 50.5554 49.6800 48.8075 47.9359 47.0640 46.1914 45.3182
## ET 51.4589 50.5763 49.7032 48.8373 47.9758 47.1164 46.2573 45.3974 44.5365
## KE 46.7151 45.8045 44.9024 44.0091 43.1231 42.2458 41.3768 40.5158 39.6634
## Age
## Country 27 28 29 30 31 32 33 34 35
## BI 41.5478 40.7196 39.8916 39.0638 38.2363 37.4090 36.5815 35.7537 34.9257
## KM 42.0596 41.1557 40.2519 39.3487 38.4465 37.5459 36.6471 35.7509 34.8579
## DJ 42.0110 41.1711 40.3309 39.4910 38.6513 37.8117 36.9721 36.1323 35.2926
## ER 44.4444 43.5705 42.6964 41.8223 40.9479 40.0729 39.1969 38.3197 37.4414
## ET 43.6745 42.8118 41.9486 41.0852 40.2217 39.3577 38.4931 37.6278 36.7621
## KE 38.8191 37.9838 37.1561 36.3369 35.5263 34.7219 33.9267 33.1402 32.3639
## Age
## Country 36 37 38 39 40 41 42 43 44
## BI 34.0983 33.2722 32.4487 31.6290 30.8142 30.0044 29.1992 28.3975 27.5979
## KM 33.9687 33.0837 32.2033 31.3279 30.4578 29.5931 28.7341 27.8812 27.0347
## DJ 34.4538 33.6166 32.7824 31.9523 31.1275 30.3083 29.4942 28.6841 27.8768
## ER 36.5628 35.6847 34.8083 33.9353 33.0672 32.2045 31.3469 30.4934 29.6429
## ET 35.8966 35.0323 34.1703 33.3119 32.4583 31.6099 30.7663 29.9266 29.0898
## KE 31.5991 30.8432 30.0970 29.3592 28.6295 27.9087 27.1942 26.4862 25.7855
## Age
## Country 45 46 47 48 49 50 51 52 53
## BI 26.7998 26.0032 25.2093 24.4200 23.6380 22.8659 22.1051 21.3557 20.6162
## KM 26.1954 25.3639 24.5410 23.7274 22.9239 22.1309 21.3487 20.5772 19.8163
## DJ 27.0714 26.2681 25.4676 24.6720 23.8835 23.1043 22.3359 21.5782 20.8300
## ER 28.7948 27.9492 27.1072 26.2706 25.4416 24.6224 23.8146 23.0182 22.2323
## ET 28.2550 27.4224 26.5929 25.7683 24.9510 24.1430 23.3460 22.5599 21.7835
## KE 25.0924 24.4086 23.7329 23.0660 22.4086 21.7602 21.1228 20.4913 19.8679
## Age
## Country 54 55 56 57 58 59 60 61 62
## BI 19.8843 19.1579 18.4360 17.7191 17.0090 16.3094 15.6239 14.9552 14.3046
## KM 19.0657 18.3256 17.5962 16.8784 16.1731 15.4822 14.8070 14.1487 13.5076
## DJ 20.0893 19.3539 18.6229 17.8965 17.1766 16.4662 15.7687 15.0868 14.4221
## ER 21.4545 20.6828 19.9161 19.1544 18.3994 17.6541 16.9220 16.2060 15.5080
## ET 21.0147 20.2513 19.4922 18.7377 17.9895 17.2508 16.5253 15.8159 15.1242
## KE 19.2506 18.6399 18.0373 17.4414 16.8544 16.2786 15.7161 15.1721 14.6421
## Age
## Country 63 64 65 66 67 68 69 70 71
## BI 13.6714 13.0547 12.4532 11.8663 11.2936 10.7359 10.1946 9.6714 9.1672
## KM 12.8834 12.2757 11.6840 11.1083 10.5487 10.0057 9.4800 8.9730 8.4854
## DJ 13.7749 13.1447 12.5310 11.9332 11.3510 10.7846 10.2354 9.7050 9.1943
## ER 14.8283 14.1669 13.5234 12.8969 12.2863 11.6909 11.1110 10.5477 10.0015
## ET 14.4507 13.7950 13.1565 12.5344 11.9279 11.3371 10.7626 10.2057 9.6673
## KE 14.1301 13.6347 13.1572 12.6991 12.2436 11.7919 11.3450 10.9027 10.4712
## Age
## Country 72 73 74 75 76 77 78 79 80 81
## BI 8.6809 8.2112 7.7567 7.3161 6.8887 6.4748 6.0752 5.6923 5.3286 4.9859
## KM 8.0181 7.5710 7.1436 6.7349 6.3438 5.9689 5.6091 5.2641 4.9341 4.6195
## DJ 8.7026 8.2285 7.7697 7.3247 6.8927 6.4738 6.0691 5.6808 5.3116 4.9630
## ER 9.4733 8.9627 8.4680 7.9882 7.5228 7.0719 6.6359 6.2168 5.8170 5.4381
## ET 9.1475 8.6456 8.1600 7.6895 7.2334 6.7920 6.3662 5.9579 5.5689 5.2004
## KE 10.0433 9.6278 9.2145 8.8035 8.3939 7.9828 7.5684 7.1599 6.7482 6.3447
## Age
## Country 82 83 84 85 86 87 88 89 90 91
## BI 4.6643 4.3627 4.0834 3.8294 3.6001 3.3918 3.1976 3.0138 2.8401 2.6766
## KM 4.3209 4.0391 3.7798 3.5495 3.3487 3.1727 3.0121 2.8598 2.7145 2.5757
## DJ 4.6350 4.3266 4.0415 3.7842 3.5543 3.3477 3.1559 2.9741 2.8024 2.6408
## ER 5.0810 4.7458 4.4362 4.1567 3.9063 3.6802 3.4696 3.2685 3.0756 2.8908
## ET 4.8525 4.5244 4.2204 3.9458 3.7002 3.4793 3.2742 3.0796 2.8959 2.7232
## KE 5.9454 5.5795 5.2395 4.9383 4.6669 4.4251 4.1976 3.9825 3.7738 3.5886
## Age
## Country 92 93 94 95 96 97 98 99 100
## BI 2.5229 2.3786 2.2438 2.1184 2.0004 1.8856 1.7745 1.6741 1.5910
## KM 2.4431 2.3178 2.2012 2.0942 1.9971 1.9094 1.8314 1.7614 1.6961
## DJ 2.4892 2.3476 2.2163 2.0955 1.9856 1.8866 1.7979 1.7183 1.6467
## ER 2.7154 2.5515 2.4004 2.2609 2.1231 1.9951 1.8799 1.7718 1.6722
## ET 2.5615 2.4108 2.2713 2.1431 2.0266 1.9213 1.8261 1.7394 1.6601
## KE 3.3956 3.2221 3.0582 2.9074 2.7734 2.6494 2.5318 2.4156 2.3153
matplot(1:100,t(lgDifLE), type="l", col=8,
xlab="Age",ylab="log-Increase in life expectancy",
main="log-Increase in life expectancy, all countries, year 2021",
cex.main=.75,cex.lab=.75,cex.axis=.75)
i <- which(rownames(lgDifLE)=='ES')
lines(1:100,lgDifLE[i,],lwd=4, col=1)
legend("bottomright",CtryName[i],lwd=4,col=1,bty="n")
The file ExamOctober2023.Rdata contains the following
data:
LE, a matrix (countries in rows, age in columns) with
life expectancy at exact agelgDifLE, a matrix (countries in rows, age in columns)
with logarithms of the annual increase in life expectancy.CtryName, a vector with the country names.CtryISO2, a vector with the 2-digits ISO country
codes.Consider the joint distribution of x=scale(LE[,1]),
y=scale(lgDifLE[,15]), that is, the joint distribution of
the centered and scaled values of the life expectancy at birth and the
log-Increase in life expectancy at the age of 15.
Estimate the joint density of \((x,y)\) using a bivariate Gaussian kernel
estimator with equal bandwidths in both dimensions, h=a*c(1,1), with
a chosen by maximum log-likelihood leave-one-out
cross-validation in seq(0.15,.3,by=0.025).
Which is the optimal value of a? Give at least two
graphical representations of the estimated density function. When it is
possible, add the observed points to the graphics.
x <- scale(LE[,1])
y <-scale(lgDifLE[,15])
X <- cbind(x,y)
library(sm)
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
plot(X,col=8)
sm.density(X,h=0.15*c(1,1),display="slice",props=c(25,50,75,95),col=2,add=TRUE)
a_values <- seq(0.15,.3,by=0.025)
na <- length(a_values)
logLCVa <- numeric(length(a_values))
looCV_log__lik=list()
n <- dim(X)[1]
for (j in 1:na) {
a <- a_values[j]
for (i in 1:n){
new.point <- matrix(X[i,],ncol=2)
f.hat.i <- sm.density(X[-i,],h=a*c(1,1),display="none",eval.grid=FALSE,
eval.points=new.point)$estimate
logLCVa[j] <- logLCVa[j] + log(f.hat.i)
}
}
a_loo <- a_values[which.max(logLCVa)]
plot(a_values,logLCVa,type="b", main=paste("a_loo=",a_loo))
abline(v=a_loo,col=2)
best=a_values[which.max(logLCVa)]
print(paste("best b (LOOCV)",best))
## [1] "best b (LOOCV) 0.225"
plot(x,y,col="gray", asp=1)
plot(X,col=8)
sm.density(X,h=best*c(1,1),display="slice",col=2,add=TRUE)
sm.density(X, h=best*c(1,1), display="slice", add=TRUE, col="red", props=95)
sm.density(X, h=a_loo*c(1,1), display="persp")
Consider the previous bivariate data set \((x_i, y_i)\), \(i=1,\ldots,n\), where \(n\) is the number of countries.
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
library(sm)
library(fpc)
library(ggplot2)
library(cluster)
rng=2:6
X <- cbind(x,y)
GMM <- Mclust(as.matrix(X), parameters=TRUE)
clust.ind <- GMM$classification
summary(GMM,parameters=TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EVV (ellipsoidal, equal volume) model with 2 components:
##
## log-likelihood n df BIC ICL
## -465.7109 236 10 -986.0601 -1093.341
##
## Clustering table:
## 1 2
## 126 110
##
## Mixing probabilities:
## 1 2
## 0.5432526 0.4567474
##
## Means:
## [,1] [,2]
## [1,] -0.5450809 0.6483159
## [2,] 0.6194237 -0.7367387
##
## Variances:
## [,,1]
## [,1] [,2]
## [1,] 0.9304610 -0.6846706
## [2,] -0.6846706 0.6022129
## [,,2]
## [,1] [,2]
## [1,] 0.3335657 -0.2567296
## [2,] -0.2567296 0.4720866
plot(GMM, what="BIC",asp=1)
plot(GMM, what="classification",asp=1)
plot(GMM, what="density",asp=1)
plot(GMM, what="uncertainty",asp=1)
summary(GMM,parameters=TRUE)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EVV (ellipsoidal, equal volume) model with 2 components:
##
## log-likelihood n df BIC ICL
## -465.7109 236 10 -986.0601 -1093.341
##
## Clustering table:
## 1 2
## 126 110
##
## Mixing probabilities:
## 1 2
## 0.5432526 0.4567474
##
## Means:
## [,1] [,2]
## [1,] -0.5450809 0.6483159
## [2,] 0.6194237 -0.7367387
##
## Variances:
## [,,1]
## [,1] [,2]
## [1,] 0.9304610 -0.6846706
## [2,] -0.6846706 0.6022129
## [,,2]
## [,1] [,2]
## [1,] 0.3335657 -0.2567296
## [2,] -0.2567296 0.4720866
# the fitted model has 2 mixture components
# The fitted GMM by Mclust is EVV (ellipsoidal, equal volume)
# indicates that the two variance matrices have equal volume (proportional of the product of their eigenvalues)
# varying shape (different the ratio of eigenvalues)
# varying orientation (eigenvectors in both matrices ar not parallel)
# # Therefore, the variance matrices are different.
# As seen above the the components have different variance matrices
# Variances:
# [,,1]
# [,1] [,2]
# [1,] 0.7467912 -0.5213116
# [2,] -0.5213116 0.4474711
# [,,2]
# [,1] [,2]
# [1,] 0.3729693 -0.2977819
# [2,] -0.2977819 0.5147658
par(mfrow = c(1, 2))
plot(GMM, what="density", sub="Mclust density estimation", asp=1, col=2)
points(x,y, col = "grey")
plot(X, sub="GMM density estimation", col = "grey")
sm.density(X,h=best*c(1,1),display="slice",props=c(50,75,95),col=2,add=TRUE)
#
# The GMMM is unimodal, but the non-parametric density is 3-modal.
# Other characteristics are similar in both estimations.
plot(GMM, what="classification", asp=1, xlab="scale(LE[,1])",ylab="scale(lgDifLE[,15])")
points(X)
text(X[,1], X[,2], rownames(X), pos = 3, cex = .7)
# there are 2 clusters
#Some countries ate each cluster
# Blue cluster
Icl1 <- which(y<=-2)
cat("\n \n Blue cluster\n")
##
##
## Blue cluster
CtryName[Icl1]
## [1] "China, Hong Kong SAR" "Luxembourg" "Australia"
# Countries with high Life Expectancy at birth, larger than 82 years. So all the yearly increment of LE are small.
# For age=15 this increment are lower than 0.4 years for the selected countries.
# Red cluster
Icl2 <- which(x<=-2)
cat("\n \n Red cluster\n")
##
##
## Red cluster
cat("\n \n Red cluster\n")
##
##
## Red cluster
# Countries with low Life Expectancy at birth, lower than 57 years.
# So the yearly increment of LE are large, in particular for age=15.
# this increment are larger than 3.7 years for the selected countries except one (which has 2.2 years).
library(fpc)
epsilon <- 0.25
minPts <- 4
dbscan_result <- fpc::dbscan(X, eps = epsilon, MinPts = minPts, showplot = 0)
plot(dbscan_result, x, main = sprintf("DBSCAN eps=%.2f minPts=%d", epsilon, minPts), frame = FALSE, xlab="x",ylab="y")
# 4 clusters have been detected by dbcscan
dbscan_result
## dbscan Pts=236 MinPts=4 eps=0.25
## 0 1 2 3 4
## border 22 1 8 3 4
## seed 0 55 138 1 4
## total 22 56 146 4 8
# 22 outliers have been detected bz dbscan
op<-par(mfrow=c(1,2))
plot(GMM, what="classification",asp=1, main="GMM",
xlab="scale(LE[,1])",ylab="scale(lgDifLE[,15])")
plot(X, col=dbscan_result$cluster+1, pch=dbscan_result$cluster+1,
main="dbscan",
xlab="scale(LE[,1])",ylab="scale(lgDifLE[,15])",asp=1)
par(op)
table(dbscan_result$cluster,GMM$classification)
##
## 1 2
## 0 9 13
## 1 56 0
## 2 61 85
## 3 0 4
## 4 0 8
# There is no a perfect coincidence between both clustering methods
Consider the data matrix lgDifLE. Do dimensionality reduction of the columns of lgDifLE, from 100 dimensions to \(q=1\), in two different ways:
4.1 Using ISOMAP. Choose epsilon in seq(7, 12,by=1) as the value maximizing the correlation between the distance matrices in high and low dimensioal spaces.
Dlg <- dist(lgDifLE)
q<-1
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
epsilon <- seq(7, 12,by=1)
corr.dists <- array(0,dim=length(epsilon))
isomap.eps <- array(vector("list",1),dim=dim(corr.dists))
for (i in 1:length(epsilon)){
isomap.eps[[i]] <- isomap(as.matrix(Dlg), ndim=q, epsilon=epsilon[i])
D2.eps <- dist(isomap.eps[[i]]$points)
corr.dists[i] <- cor(Dlg,D2.eps)^2
#print(c(i,j,LC[i,j]))
}
i.max <- which.max(corr.dists)
eps.max <- epsilon[i.max]
isomap.max <- isomap.eps[[i.max]]
isomap.corr.dists.max <- corr.dists[i.max]
plot(epsilon,corr.dists, type="b")
abline(v=eps.max, col=2)
print(paste0("ISOMAP: epsilon.max=",eps.max,", corr.dists(epsilon.max)=",corr.dists[i.max]))
## [1] "ISOMAP: epsilon.max=9, corr.dists(epsilon.max)=0.970144362395919"
pairs(cbind(X[,1:2] ,isomap=isomap.max$points[,1]), pch=20, main="Best ISOMAP output in 1-dim")
plot(isomap.max$points,isomap.max$points)
text(isomap.max$points,isomap.max$points,CtryName,
pos=3-sign(isomap.max$points), cex=.75)
Select 3 countries located in the low dimensional space in such a way that they cover the variability of all the points in this dimension. Then plot their lgDifLE functions in a graphic as that in page 2.
# Select 3 countries located in the low dimensional space in such a way that they cover the variability of all the points in this dimension.
imin <- which.min(isomap.max$points)
imed <- which.min(abs(isomap.max$points -
median(isomap.max$points)))
imax <- which.max(isomap.max$points)
i3 <- c(imin,imed,imax)
matplot(1:100,t(lgDifLE), type="l", col=8,
xlab="Age",ylab="log-Increase in life expectancy",
main="log-Increase in life expectancy, all countries, year 2021",
cex.main=.75,cex.lab=.75,cex.axis=.75)
for (i in 1:3) lines(1:100,lgDifLE[i3[i],],lwd=4, col=i)
legend("bottomright",CtryName[i3],lwd=4,col=1:3,bty="n")
How do you interpret the dimension obtained with ISOMAP?
# The one-dimentional scores obtained by isomap sorts the countries from those having lower LE(age) and higher lgDifLE(age), to those having gihger LE(age) and lower lgDifLE(age)
4.2 USING t-SNE. Use theta=0 in Rtsne function and choose perplexity in seq(10,50,by=10) as the value maximizing the correlation between the distance matrices in high and low dimensioal spaces.
library("Rtsne")
D=dist(as.matrix(X))
set.seed(42)
theta= 0.0
perplexity <- seq(10,50,by=10)
q=1
corr.dists <- array(0,dim=length(perplexity))
tSNE.perp <- array(vector("list",1),dim=dim(corr.dists))
for (i in 1:length(perplexity)){
tSNE.perp[[i]] <- Rtsne(D, dims=q,
perplexity=perplexity[i],
theta=theta, num_threads = 1)
D2.perp <- dist(tSNE.perp[[i]]$Y)
corr.dists[i] <- cor(Dlg,D2.perp)^2
#print(c(i,j,LC[i,j]))
}
i.max <- which.max(corr.dists)
perp.max <- perplexity[i.max]
tSNE.max <- tSNE.perp[[i.max]]
tSNE.corr.dists.max <- corr.dists[i.max]
plot(perplexity,corr.dists, type="b")
abline(v=perp.max, col=2)
print(paste0("tSNE: perplexity.max=",perp.max,", corr.dists(perplexity.max)=",corr.dists[i.max]))
## [1] "tSNE: perplexity.max=50, corr.dists(perplexity.max)=0.849613763685878"
plot(tSNE.max$Y,tSNE.max$Y, main="t-SNE")
text(tSNE.max$Y,tSNE.max$Y,CtryName, #rownames(X)
pos=3-sign(tSNE.max$Y), cex=.75)
pairs(cbind(x=X[,1], y=X[,2] ,tsne=tSNE.max$Y[,1]), pch=20, main="Best t-SNE output in 1-dim")
- Select 3 countries located in the low dimensional space in such a way
that they cover the variability of all the points in this dimension.
Then plot their lgDifLE functions in a graphic as that in page 2.
# Select 3 countries located in the low dimensional space in such a way that they cover the variability of all the points in this dimension.
imin <- which.min(tSNE.max$Y)
imed <- which.min(abs(tSNE.max$Y -
median(tSNE.max$Y)))
imax <- which.max(tSNE.max$Y)
i3 <- c(imin,imed,imax)
matplot(1:100,t(lgDifLE), type="l", col=8,
xlab="Age",ylab="log-Increase in life expectancy",
main="log-Increase in life expectancy, all countries, year 2021",
cex.main=.75,cex.lab=.75,cex.axis=.75)
for (i in 1:3) lines(1:100,lgDifLE[i3[i],],lwd=4, col=i)
legend("bottomright",CtryName[i3],lwd=4,col=1:3,bty="n")
# The one-dimentional scores obtained by tSNE sorts the countries from those having lower LE(age) and higher lgDifLE(age), to those having higher LE(age) and lower lgDifLE(age)
4.3 Do a pairs plot (that is, a matrix of scatterplots) of the matrix having the following 4 columns:
# compare to PCA
# pca_le = princomp(lgDifLE)$scores[,1]
# plot(pca_le, pca_le, t='n')
# text(pca_le,pca_le, labels=rownames(X))
aux <- cbind(LE[,1], princomp(lgDifLE)$scores[,1],isomap.max$points,tSNE.max$Y)
colnames(aux) <- c("LE_birth", "PC.1", "Isomap", "tSNE")
pairs(aux)
# The one-dimentional scores obtained by the three methods (PC1, isomap and tSNE) sorts the countries approximately in the same order as LE at birth.
# PC1 is almost equialt to isomap.